arXiv:1505.03540v3 [cs.CV] 20 May 2016 


Brain Tumor Segmentation with Deep Neural Networks^ 


Mohammad Havaei^’\ Axel Davy^, David Warde-Farley^, Antoine Biard^’^, Aaron Courville^, Yoshua Bengio^, Chris Pah’^, 

Pierre-Marc Jodoin^, Hugo Larochelle^’^ 

^ Universite de Sherbrooke, Sherbrooke, Qc, Canada 
^Ecole Normale superieure, Paris, France 
^ Universite de Montreal, Montreal, Canada 
^Ecole poly technique, Palaiseau, France 
^Ecole Polytechnique de Montreal, Canada 
^Twitter, USA 


Abstract 

In this paper, we present a fully automatic brain tumor segmentation method based on Deep Neural Networks (DNNs). The 
proposed networks are tailored to glioblastomas (both low and high grade) pictured in MR images. By their very nature, these 
tumors can appear anywhere in the brain and have almost any kind of shape, size, and contrast. These reasons motivate our 
exploration of a machine learning solution that exploits a flexible, high capacity DNN while being extremely efficient. Here, we 
give a description of different model choices that we’ve found to be necessary for obtaining competitive performance. We explore 
in particular different architectures based on Convolutional Neural Networks (CNN), i.e. DNNs specifically adapted to image data. 

We present a novel CNN architecture which differs from those traditionally used in computer vision. Our CNN exploits both local 
features as well as more global contextual features simultaneously. Also, different from most traditional uses of CNNs, our networks 
use a final layer that is a convolutional implementation of a fully connected layer which allows a 40 fold speed up. We also describe 
a 2-phase training procedure that allows us to tackle difficulties related to the imbalance of tumor labels. Finally, we explore a 
cascade architecture in which the output of a basic CNN is treated as an additional source of information for a subsequent CNN. 
Results reported on the 2013 BRATS test dataset reveal that our architecture improves over the currently published state-of-the-art 
while being over 30 times faster. 
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1. Introduction 

In the United States alone, it is estimated that 23,000 new 
cases of brain cancer will be diagnosed in 201f|^ While 
gliomas are the most common brain tumors, they can be less 
aggressive (i.e. low grade) in a patient with a life expectancy of 
several years, or more aggressive (i.e. high grade) in a patient 
with a life expectancy of at most 2 years. 

Although surgery is the most common treatment for brain 
tumors, radiation and chemotherapy may be used to slow the 
growth of tumors that cannot be physically removed. Magnetic 
resonance imaging (MRI) provides detailed images of the brain, 
and is one of the most common tests used to diagnose brain tu¬ 
mors. All the more, brain tumor segmentation from MR im¬ 
ages can have great impact for improved diagnostics, growth 
rate prediction and treatment planning. 

While some tumors such as meningiomas can be easily seg¬ 
mented, others like gliomas and glioblastomas are much more 
difficult to localize. These tumors (together with their surround¬ 
ing edema) are often diffused, poorly contrasted, and extend 
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tentacle-like structures that make them difficult to segment. An¬ 
other fundamental difficulty with segmenting brain tumors is 
that they can appear anywhere in the brain, in almost any shape 
and size. Furthermore, unlike images derived from X-ray com¬ 
puted tomography (CT) scans, the scale of voxel values in MR 
images is not standardized. Depending on the type of MR ma¬ 
chine used (1.5, 3 or 7 tesla) and the acquisition protocol (field 
of view value, voxel resolution, gradient strength, bO value, 
etc.), the same tumorous cells may end up having drastically 
different grayscale values when pictured in different hospitals. 

Healthy brains are typically made of 3 types of tissues: the 
white matter, the gray matter, and the cerebrospinal fluid. The 
goal of brain tumor segmentation is to detect the location and 
extension of the tumor regions, namely active tumorous tissue 
(vascularized or not), necrotic tissue, and edema (swelling near 
the tumor). This is done by identifying abnormal areas when 
compared to normal tissue. Since glioblastomas are infiltra¬ 
tive tumors, their borders are often fuzzy and hard to distin¬ 
guish from healthy tissues. As a solution, more than one MRI 
modality is often employed, e.g. T1 (spin-lattice relaxation), 
Tl-contrasted (TIC), T2 (spin-spin relaxation), proton density 
(PD) contrast imaging, diffusion MRI (dMRI), and fluid atten¬ 
uation inversion recovery (FLAIR) pulse sequences. The con- 
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trast between these modalities gives almost a unique signature 
to each tissue type. 

Most automatic brain tumor segmentation methods use hand- 
designed features El ED. These methods implement a clas¬ 
sical machine learning pipeline according to which features 
are first extracted and then given to a classifier whose train¬ 
ing procedure does not affect the nature of those features. An 
alternative approach for designing task-adapted feature repre¬ 
sentations is to learn a hierarchy of increasingly complex fea¬ 
tures directly from in-domain data. Deep neural networks have 
been shown to excel at learning such feature hierarchies 171. In 
this work, we apply this approach to learn feature hierarchies 
adapted specifically to the task of brain tumor segmentation that 
combine information across MRI modalities. 

Specifically, we investigate several choices for training Con¬ 
volutional Neural Networks (CNNs), which are Deep Neural 
Networks (DNNs) adapted to image data. We report their 
advantages, disadvantages and performance using well estab¬ 
lished metrics. Although CNNs first appeared over two decades 
ago |[29l . they have recently become a mainstay of the com¬ 
puter vision community due to their record-shattering perfor¬ 
mance in the ImageNet Large-Scale Visual Recognition Chal¬ 
lenge Czl While CNNs have also been successfully applied 
to segmentation problems dEUEIlIEl, most of the previous 
work has focused on non-medical tasks and many involve ar¬ 
chitectures that are not well suited to medical imagery or brain 
tumor segmentation in particular. Our preliminary work on us¬ 
ing convolutional neural networks for brain tumor segmentation 
together with two other methods using CNNs was presented in 
BRATS N4 workshop. However, those results were incomplete 
and required more investigation (More on this in chapter |^. 

In this paper, we propose a number of specific CNN architec¬ 
tures for tackling brain tumor segmentation. Our architectures 
exploit the most recent advances in CNN design and training 
techniques, such as Maxout El hidden units and Dropout ia 
regularization. We also investigate several architectures which 
take into account both the local shape of tumors as well as their 
context. 

One problem with many machine learning methods is that 
they perform pixel classification without taking into account the 
local dependencies of labels (i.e. segmentation labels are con¬ 
ditionally independent given the input image). To account for 
this, one can employ structured output methods such as condi¬ 
tional random fields (CRTs), for which inference can be com¬ 
putationally expensive. Alternatively, one can model label de¬ 
pendencies by considering the pixel-wise probability estimates 
of an initial CNN as additional input to certain layers of a sec¬ 
ond DNN, forming a cascaded architecture. Since convolutions 
are efficient operations, this approach can be significantly faster 
than implementing a CRF. 

We focus our experimental analysis on the fully-annotated 
MICCAI brain tumor segmentation (BRATS) challenge 2013 
dataset El using the well defined training and testing splits, 
thereby allowing us to compare directly and quantitatively to a 
wide variety of other methods. 

Our contributions in this work are four fold: 


1. We propose a fully automatic method with results cur¬ 
rently ranked second on the BRATS 2013 scoreboard; 

2. To segment a brain, our method takes between 25 seconds 
and 3 minutes, which is one order of magnitude faster than 
most state-of-the-art methods. 

3. Our CNN implements a novel two-pathway architecture 
that learns about the local details of the brain as well as 
the larger context. We also propose a two-phase training 
procedure which we have found is critical to deal with im¬ 
balanced label distributions. Details of these contributions 
are described in Sections [3.1.11 and [T2l 

4. We employ a novel cascaded architecture as an efficient 
and conceptually clean alternative to popular structured 
output methods. Details on those models are presented 
in Section [ 3 X 2 I 

2. Related work 

As noted by Menze et al. EZl, the number of publications 
devoted to automated brain tumor segmentation has grown ex¬ 
ponentially in the last several decades. This observation not 
only underlines the need for automatic brain tumor segmenta¬ 
tion tools, but also shows that research in that area is still a work 
in progress. 

Brain tumor segmentation methods (especially those devoted 
to MRI) can be roughly divided in two categories: those based 
on generative models and those based on discriminative mod- 

els (usia. 

Generative models rely heavily on domain-specific prior 
knowledge about the appearance of both healthy and tumor¬ 
ous tissues. Tissue appearance is challenging to characterize, 
and existing generative models usually identify a tumor as be¬ 
ing a shape or a signal which deviates from a normal (or av¬ 
erage) brain (91. Typically, these methods rely on anatomical 
models obtained after aligning the 3D MR image on an at¬ 
las or a template computed from several healthy brains (TTl . 
A typical generative model of MR brain images can be found 
in Prastawa et al. Ell. Given the ICBM brain atlas, the method 
aligns the brain to the atlas and computes posterior probabilities 
of healthy tissues (white matter, gray matter and cerebrospinal 
fluid) . Tumorous regions are then found by localizing vox¬ 
els whose posterior probability is below a certain threshold. A 
post-processing step is then applied to ensure good spatial reg¬ 
ularity. Prastawa et al. (381 also register brain images onto an 
atlas in order to get a probability map for abnormalities. An 
active contour is then initialized on this map and iterated until 
the change in posterior probability is below a certain threshold. 
Many other active-contour methods along the same lines have 
been proposed EiMEa, all of which depend on left-right 
brain symmetry features and/or alignment-based features. Note 
that since aligning a brain with a large tumor onto a template 
can be challenging, some methods perform registration and tu¬ 
mor segmentation at the same time (28l[34l . 

Other approaches for brain tumor segmentation employ dis¬ 
criminative models. Unlike generative modeling approaches, 
these approaches exploit little prior knowledge on the brain’s 
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anatomy and instead rely mostly on the extraction of [a large 
number of] low level image features, directly modeling the 
relationship between these features and the label of a given 
voxel. These features may be raw input pixels values 1221 [201 . 
local histograms (261 ESI texture features such as Gabor fil- 
terbanks (44l [43l, or alignment-based features such as inter¬ 
image gradient, region shape difference, and symmetry anal¬ 
ysis |[33l . Classical discriminative learning techniques such 
as SVMs (H [m [30l and decision forests ma have also been 
used. Results from the 2012, 2013 and 2014 editions of the 
MICCAI-BRATS Challenge suggest that methods relying on 
random forests are among the most accurate 

One common aspect with discriminative models is their im¬ 
plementation of a conventional machine learning pipeline rely¬ 
ing on hand-designed features. For these methods, the classifier 
is trained to separate healthy from non-heatlthy tissues assum¬ 
ing that the input features have a sufficiently high discriminative 
power since the behavior the classifier is independent from na¬ 
ture of those features. One difficulty with methods based on 
hand-designed features is that they often require the computa¬ 
tion of a large number of features in order to be accurate when 
used with many traditional machine learning techniques. This 
can make them slow to compute and expensive memory-wise. 
More efficient techniques employ lower numbers of features, 
using dimensionality reduction or feature selection methods, 
but the reduction in the number of features is often at the cost 
of reduced accuracy. 

By their nature, many hand-engineered features exploit very 
generic edge-related information, with no specific adaptation to 
the domain of brain tumors. Ideally, one would like to have 
features that are composed and refined into higher-level, task- 
adapted representations. Recently, preliminary investigations 
have shown that the use of deep CNNs for brain tumor segmen¬ 
tation makes for a very promising approach (see the BRATS 
2014 challenge workshop papers of Davy et al. ifTTIl . Zikic et al. 
(491 . Urban et al. (45l ). All three methods divide the 3D MR 
images into 2D (III1491 or 3D patches (45l and train a CNN to 
predict its center pixel class. Urban et al. (H as well as Zikic 
et al. (49l implemented a fairly common CNN, consisting of a 
series of convolutional layers, a non-linear activation function 
between each layer and a softmax output layer. Our work her^ 
extends our preliminary results presented in Davy et al. m us¬ 
ing a two-pathway architecture, which we use here as a building 
block. 

In computer vision, CNN-based segmentation models have 
typically been applied to natural scene labeling. For these tasks, 
the inputs to the model are the RGB channels of a patch from 
a color image. The work in Pinheiro and Collobert (351 uses a 
basic CNN to make predictions for each pixel and further im¬ 
proves the predictions by using them as extra information in 
the input of a second CNN model. Other work CD involves 
several distinct CNNs processing the image at different resolu¬ 
tions. The final per-pixel class prediction is made by integrating 
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information learned from all CNNs. To produce a smooth seg¬ 
mentation, these predictions are regularized using a more global 
superpixel segmentation of the image. Like our work, other re¬ 
cent work has exploited convolution operations in the final layer 
of a network to extend traditional CNN architectures for seman¬ 
tic scene segmentation ED. In the medical imaging domain in 
general there has been comparatively less work using CNNs for 
segmentation. However, some notable recent work by Huang 
and Jain (231 has used CNNs to predict the boundaries of neu¬ 
ral tissue in electron microscopy images. Here we explore an 
approach with similarities to the various approaches discussed 
above, but in the context of brain tumor segmentation. 

3. Our Convolutional Neural Network Approach 

Since the brains in the BRATS dataset lack resolution in the 
third dimension, we consider performing the segmentation slice 
by slice from the axial view. Thus, our model processes sequen¬ 
tially each 2D axial image (slice) where each pixel is associ¬ 
ated with different image modalities namely; Tl, T2, TIC and 
FLAIR. Like most CNN-based segmentation models (35l[T3l, 
our method predicts the class of a pixel by processing the MxM 
patch centered on that pixel. The input X of our CNN model is 
thus an M X M 2D patch with several modalities. 

The main building block used to construct a CNN architec¬ 
ture is the convolutional layer. Several layers can be stacked 
on top of each other forming a hierarchy of features. Each 
layer can be understood as extracting features from its preced¬ 
ing layer into the hierarchy to which it is connected. A single 
convolutional layer takes as input a stack of input planes and 
produces as output some number of output planes or feature 
maps. Each feature map can be thought of as a topologically 
arranged map of responses of a particular spatially local non¬ 
linear feature extractor (the parameters of which are learned), 
applied identically to each spatial neighborhood of the input 
planes in a sliding window fashion. In the case of a first con¬ 
volutional layer, the individual input planes correspond to dif¬ 
ferent MRI modalities (in typical computer vision applications, 
the individual input planes correspond to the red, green and blue 
color channels). In subsequent layers, the input planes typically 
consist of the feature maps of the previous layer. 

Computing a feature map in a convolutional layer (see Fig¬ 
ure El) consists of the following three steps: 

1. Convolution of kernels (filters): Each feature map is 
associated with one kernel (or several, in the case of Max- 
out). The feature map is computed as follows: 

= + ( 1 ) 

r 

where is the input channel, is the sub-kernel 
for that channel, * is the convolution operation and bg is 
a bias ternj^ In other words, the affine operation being 


^Since the convolutional layer is associated to R input channels, X contains 
M X M X R gray-scale values and thus each kernel contains N x N x R 
weights. Accordingly, the number of parameters in a convolutional block of 
consisting of S feature maps is equal to RxMxMxS. 
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Figure 1: A single convolution layer block showing computa¬ 
tions for a single feature map. The input patch (here 7 x 7), is 
convolved with series of kernels (here 3x3) followed by Max¬ 
out and max-pooling. 

performed for each feature map is the sum of the appli¬ 
cation of R different 2-dimensional N x N convolution 
filters (one per input channel/modality), plus a bias term 
which is added pixel-wise to each resulting spatial posi¬ 
tion. Though the input to this operation is sl M x M x R 
3-dimensional tensor, the spatial topology being consid¬ 
ered is 2-dimensional in the X-Y axial plane of the original 
brain volume. 

Whereas traditional image feature extraction methods rely 
on a fixed recipe (sometimes taking the form of convo¬ 
lution with a linear e.g. Gabor filter bank), the key to 
the success of convolutional neural networks is their abil¬ 
ity to learn the weights and biases of individual feature 
maps, giving rise to data-driven, customized, task-specific 
dense feature extractors. These parameters are adapted 
via stochastic gradient descent on a surrogate loss function 
related to the misclassification error, with gradients com¬ 
puted efficiently via the backpropagation algorithm ioi. 
Special attention must be paid to the treatment of border 
pixels by the convolution operation. Throughout our archi¬ 
tecture, we employ the so-called valid-mode convolution, 
meaning that the filter response is not computed for pixel 
positions that are less than \_NI2\ pixels away from the im¬ 
age border. An NxN filter convolved with an MxM input 
patch will result in a 2 x 2 output, where Q = M - N 1. 
In Figure M = 1, N = 3 and thus Q = 5. Note that the 
size (spatial width and height) of the kernels are hyper¬ 
parameters that must be specified by the user. 

2. Non-linear activation function: To obtain features that are 
non-linear transformations of the input, an element-wise 
non-linearity is applied to the result of the kernel convo¬ 
lution. There are multiple choices for this non-linearity, 
such as the sigmoid, hyperbolic tangent and rectified lin¬ 
ear functions m, csi- 

Recently, Goodfellow et al. na proposed a Maxout non¬ 
linearity, which has been shown to be particularly effec¬ 
tive at modeling useful features. Maxout features are as¬ 
sociated with multiple kernels W^. This implies each 
Maxout map is associated with K feature maps : 
{O^, 0^+1,..., 0^+/^_i}. Note that in Figurethe Maxout 
maps are associated with K = 2 feature maps. Maxout fea¬ 


tures correspond to taking the max over the feature maps 
O, individually for each spatial position: 

~ •••■> ^s+K-l,i,j^ (2) 

where /, j are spatial positions. Maxout features are thus 
equivalent to using a convex activation function, but whose 
shape is adaptive and depends on the values taken by the 
kernels. 

3. Max pooling: This operation consists of taking the maxi¬ 
mum feature (neuron) value over sub-windows within each 
feature map. This can be formalized as follows: 

= m^isZs^i+pj+p, (3) 

where p determines the max pooling window size. The 
sub-windows can be overlapping or not (Figure shows 
an overlapping configuration). The max-pooling operation 
shrinks the size of the feature map. This is controlled by 
the pooling size p and the stride hyper-parameter, which 
corresponds to the horizontal and vertical increments at 
which pooling sub-windows are positioned. Let S be the 
stride value and 2 x 2 be the shape of the feature map 
before max-pooling. The output of the max-pooling oper¬ 
ation would be of size D x D, where D = (Q - p)/S 1. 
In Figuresince Q = 5, p = 2,S = 1, the max-pooling 
operation results into a D = 4 output feature map. The 
motivation for this operation is to introduce invariance to 
local translations. This subsampling procedure has been 
found beneficial in other applications lITTl . 

Convolutional networks have the ability to extract a hierar¬ 
chy of increasingly complex features which makes them very 
appealing. This is done by treating the output feature maps of a 
convolutional layer as input channels to the subsequent convo¬ 
lutional layer. 

From the neural network perspective, feature maps corre¬ 
spond to a layer of hidden units or neurons. Specifically, each 
coordinate within a feature map corresponds to an individual 
neuron, for which the size of its receptive field corresponds to 
the kernel’s size. A kernel’s value also represents the weights of 
the connections between the layer’s neurons and the neurons in 
the previous layer. It is often found in practice that the learned 
kernels resemble edge detectors, each kernel being tuned to a 
different spatial frequency, scale and orientation, as is appropri¬ 
ate for the statistics of the training data. 

Finally, to perform a prediction of the segmentation labels, 
we connect the last convolutional hidden layer to a convolu¬ 
tional output layer followed by a non-linearity (i.e. no pool¬ 
ing is performed). It is necessary to note that, for segmenta¬ 
tion purposes, a conventional CNN will not yield an efficient 
test time since the output layer is typically fully connected. 
By using a convolution at the end, for which we have an ef¬ 
ficient implementation, the prediction at test time for a whole 
brain will be 45 times faster. The convolution uses as many 
kernels as there are different segmentation labels (in our case 
five). Each kernel thus acts as the ultimate detector of tissue 
from one of the segmentation labels. We use the softmax non¬ 
linearity which normalizes the result of the kernel convolutions 
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into a multinominal distribution over the labels. Specifically, let 
a be the vector of values at a given spatial position, it computes 
softmax(a) = exp(a)/Z where Z = Zc^^P(^c) is a normaliza¬ 
tion constant. More details will be discussed in Section 2] 

Noting Y as the segmentation label field over the input patch 
X, we can thus interpret each spatial position of the convolu¬ 
tional output layer as providing a model for the likelihood dis¬ 
tribution p(Yij\S.), where Ytj is the label at position /, j. We 
get the probability of all labels simply by taking the product 
of each conditional /7(Y|X) = Ylij Our approach thus 

performs a multiclass labeling by assigning to each pixel the 
label with the largest probability. 

3.1. The Architectures 

Our description of CNNs so far suggests a simple architec¬ 
ture corresponding to a single stack of several convolutional 
layers. This configuration is the most commonly implemented 
architecture in the computer vision literature. However, one 
could imagine other architectures that might be more appropri¬ 
ate for the task at hand. 

In this work, we explore a variety of architectures by using 
the concatenation of feature maps from different layers as an¬ 
other operation when composing CNNs. This operation allows 
us to construct architectures with multiple computational paths, 
which can each serve a different purpose. We now describe the 
two types of architectures that we explore in this work. 

3.1.1. Two-pathway architecture 

This architecture is made of two streams: a pathway with 
smaller 7x7 receptive fields and another with larger 13 x 13 
receptive fields. We refer to these streams as the local pathway 
and the global pathway, respectively. The motivation for this 
architectural choice is that we would like the prediction of the 
label of a pixel to be influenced by two aspects: the visual de¬ 
tails of the region around that pixel and its larger “context”, i.e. 
roughly where the patch is in the brain. 

The full architecture along with its details is illustrated in 
Figure We refer to this architecture as the TwoPathCNN. 
To allow for the concatenation of the top hidden layers of both 
pathways, we use two layers for the local pathway, with 3x3 
kernels for the second layer. While this implies that the effective 
receptive field of features in the top layer of each pathway is the 
same, the global pathway’s parametrization more directly and 
flexibly models features in that same area. The concatenation 
of the feature maps of both pathways is then fed to the output 
layer. 

3.1.2. Cascaded architectures 

One disadvantage of the CNNs described so far is that they 
predict each segmentation label separately from each other. 
This is unlike a large number of segmentation methods in the 
literature, which often propose a joint model of the segmen¬ 
tation labels, effectively modeling the direct dependencies be¬ 
tween spatially close labels. One approach is to define a con¬ 
ditional random field (CRF) over the labels and perform mean- 
field message passing inference to produce a complete segmen¬ 


tation. In this case, the final label at a given position is effec¬ 
tively influenced by the models beliefs about what the label is 
in the vicinity of that position. 

On the other hand, inference in such joint segmentation 
methods is typically more computationally expensive than a 
simple feed-forward pass through a CNN. This is an impor¬ 
tant aspect that one should take into account if automatic brain 
tumor segmentation is to be used in a day-to-day practice. 

Here, we describe CNN architectures that both exploit the 
efficiency of CNNs, while also more directly model the depen¬ 
dencies between adjacent labels in the segmentation. The idea 
is simple: since we’d like the ultimate prediction to be influ¬ 
enced by the model’s beliefs about the value of nearby labels, 
we propose to feed the output probabilities of a first CNN as 
additional inputs to the layers of a second CNN. Again, we do 
this by relying on the concatenation of convolutional layers. In 
this case, we simply concatenate the output layer of the first 
CNN with any of the layers in the second CNN. Moreover, we 
use the same two-pathway structure for both CNNs. This effec¬ 
tively corresponds to a cascade of two CNNs, thus we refer to 
such models as cascaded architectures. 

In this work, we investigated three cascaded architectures 
that concatenate the first CNN’s output at different levels of the 
second CNN: 

• Input concatenation: In this architecture, we provide the 
first CNN’s output directly as input to the second CNN. 
They are thus simply treated as additional image channels 
of the input patch. The details are illustrated in Figure 
We refer to this model as InputCascadeCNN. 

• Local pathway concatenation: In this architecture, we 
move up one layer in the local pathway and perform con¬ 
catenation to its first hidden layer, in the second CNN. The 
details are illustrated in Figure We refer to this model 
as LocalCascadeCNN. 

• Pre-output concatenation: In this last architecture, we 
move to the very end of the second CNN and perform con¬ 
catenation right before its output layer. This architecture 
is interesting, as it is similar to the computations made by 
one pass of mean-field inference ll46l in a CRF whose pair¬ 
wise potential functions are the weights in the output ker¬ 
nels. From this view, the output of the first CNN is the 
first iteration of mean-field, while the output of the sec¬ 
ond CNN would be the second iteration. The difference 
with regular mean-field however is that our CNN allows 
the output at one position to be influenced by its previ¬ 
ous value, and the convolutional kernels are not the same 
in the first and second CNN. The details are illustrated in 
Figure [3cl We refer to this model as MFCascadeCNN. 

3.2. Training 

Gradient Descent. By interpreting the output of the convolu¬ 
tional network as a model for the distribution over segmentation 
labels, a natural training criteria is to maximize the probability 
of all labels in our training set or, equivalently, to minimize the 
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Figure 2: Two-pathway CNN architecture (TwoPathCNN). The figure shows the input patch going through two paths of convolu¬ 
tional operations. The feature-maps in the local and global paths are shown in yellow and orange respectively. The convolutional 
layers used to produce these feature-maps are indicated by dashed lines in the figure. The green box embodies the whole model 
which in later architectures will be used to indicate the TwoPathCNN. 


negative log-probability -logp(Y|X) = 2/; 
each labeled brain. 

To do this, we follow a stochastic gradient descent ap¬ 
proach by repeatedly selecting labels Yij at a random subset 
of patches within each brain, computing the average negative 
log-probabilities for this mini-batch of patches and performing 
a gradient descent step on the CNNs parameters (i.e. the kernels 
at all layers). 

Performing updates based only on a small subset of patches 
allows us to avoid having to process a whole brain for each 
update, while providing reliable enough updates for learning. 
In practice, we implement this approach by creating a dataset 
of mini-batches of smaller brain image patches, paired with the 
corresponding center segmentation label as the target. 

To further improve optimization, we implemented a so-called 
momentum strategy which has been shown successful in the 
past ED. The idea of momentum is to use a temporally av¬ 
eraged gradient in order to damp the optimization velocity: 

V/+1 = yU * V/ - Of * VW/ 

W ,-+1 = W ,- + V ,>1 

where W/ stands for the CNNs parameters at iteration /, VW/ 
the gradient of the loss function at W/, V is the integrated ve¬ 
locity initialized at zero, a is the learning rate, and ji the mo¬ 
mentum coefficient. We define a schedule for the momentum ju 
where the momentum coefficient is gradually increased during 
training. In our experiments the initial momentum coefficient 
was set to yu = 0.5 and the final value was set to yu = 0.9. 

Also, the learning rate a is decreased by a factor at every 
epoch. The initial learning rate was set to a = 0.005 and the 
decay factor to 10“^ 

Two-phase training. Brain tumor segmentation is a highly data 
imbalanced problem where the healthy voxels (i.e. label 0) 
comprise 98% of total voxels. From the remaining 2% patho¬ 
logical voxels, 0.18% belongs to necrosis (label 1), 1.1% to 
edema (label 2), 0.12% to non-enhanced (label 3) and 0.38% to 
enhanced tumor (label 4). Selecting patches from the true dis¬ 
tribution would cause the model to be overwhelmed by healthy 


patches and causing problem when training out CNN models. 
Instead, we initially construct our patches dataset such that all 
labels are equiprobable. This is what we call the first training 
phase. Then, in a second phase, we account for the un-balanced 
nature of the data and re-train only the output layer (i.e. keep¬ 
ing the kernels of all other layers fixed) with a more represen¬ 
tative distribution of the labels. This way we get the best of 
both worlds: most of the capacity (the lower layers) is used in 
a balanced way to account for the diversity in all of the classes, 
while the output probabilities are calibrated correctly (thanks to 
the re-training of the output layer with the natural frequencies 
of classes in the data). 

Regularization. Successful CNNs tend to be models with a lot 
of capacity, making them vulnerable to overfitting in a setting 
like ours where there clearly are not enough training examples. 
Accordingly, we found that regularization is important in ob¬ 
taining good results. Here, regularization took several forms. 
First, in all layers, we bounded the absolute value of the kernel 
weights and applied both LI and L2 regularization to prevent 
overfitting. This is done by adding the regularization terms 
to the negative log-probability (i.e. -logp(Y|X) -r /Ii||W||i -r 
d 2 l|W|p, where Ti and A 2 are coefficients for LI and L2 regu¬ 
larization terms respectively). LI and L2 affect the parameters 
of the model in different ways, while LI encourages sparsity, L2 
encourages small values. We also used a validation set for early 
stopping, i.e. stop training when the validation performance 
stopped improving. The validation set was also used to tune the 
other hyper-parameters of the model. The reader shall note that 
the hyper-parameters of the model which includes using or not 
L2 and/or LI coefficients were selected by doing a grid search 
over range of parameters. The chosen hyper-parameters were 
the ones for which the model performed best on a validation 
set. 

Moreover, we used Dropout sa, a recent regularization 
method that works by stochastically adding noise in the com¬ 
putation of the hidden layers of the CNN. This is done by mul¬ 
tiplying each hidden or input unit by 0 (i.e. masking) with a 
certain probability (e.g. 0.5), independently for each unit and 
training update. This encourages the neural network to learn 
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(a) Cascaded architecture, using input concatenation (InputCascadeCNN). 



(b) Cascaded architecture, using local pathway concatenation (LocalCascadeCNN). 



(c) Cascaded architecture, using pre-output concatenation, which is an architecture with properties similar to that of learning using a limited number of 
mean-field inference iterations in a CRF (MFCascadeCNN). 


Figure 3: Cascaded architectures. 
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features that are useful “on their own”, since each unit cannot 
assume that other units in the same layer won’t be masked as 
well and co-adapt its behavior. At test time, units are instead 
multiplied by one minus the probability of being masked. For 
more details, see Srivastava et al. ia. 

Considering the large number of parameters our model has, 
one might think that even with our regularization strategy, the 
30 training brains from BRATS 2013 are too few to prevent 
overfitting. But as will be shown in the results section, our 
model generalizes well and thus do not overfit. One reason for 
this is the fact that each brain comes with 200 2d slices and 
thus, our model has approximately 6000 2D images to train on. 
We shall also mention that by their very nature, MRI images 
of brains are very similar from one patient to another. Since 
the variety of those images is much lower than those in real- 
image datasets such as CIFAR and ImageNet, a fewer number 
of training samples is thus needed. 

Cascaded Architectures. To train a cascaded architecture, we 
start by training the TwoPathCNN with the two phase stochas¬ 
tic gradient descent procedure described previously. Then, we 
fix the parameters of the TwoPathCNN and include it in the 
cascaded architecture (be it the InputCascadeCNN, the Local- 
CascadeCNN, or the MFCascadeCNN) and move to training 
the remaining parameters using a similar procedure. It should 
be noticed however that for the spatial size of the first CNN’s 
output and the layer of the second CNN to match, we must feed 
to the first CNN a much larger input. Thus, training of the sec¬ 
ond CNN must be performed on larger patches. For example 
in the InputCascadeCNN (Figure [^, the input size to the first 
model is of size 65 x 65 which results into an output of size 
33 X 33. Only in this case the outputs of the first CNN can be 
concatenated with the input channels of the second CNN. 

4. Implementation details 

Our implementation is based on the Pylearn2 library lEl. 
Pylearn2 is an open-source machine learning library special¬ 
izing in deep learning algorithms. It also supports the use of 
GPUs, which can greatly accelerate the execution of deep learn¬ 
ing algorithms. 

Since CNN’s are able to learn useful features from scratch, 
we applied only minimal pre-processing. We employed the 
same pre-processing as Tustison et al., the winner of the 2013 
BRATS challenge (321. The pre-processing follows three steps. 
First, the 1 % highest and lowest intensities are removed. Then, 
we apply an N4ITK bias correction m to T1 and TIC modali¬ 
ties. The data is then normalized within each input channel by 
subtracting the channel’s mean and dividing by the channel’s 
standard deviation. 

As for post-processing, a simple method based on connected 
components was implemented to remove fiat blobs which might 
appear in the predictions due to bright corners of the brains 
close to the skull. 

The hyper-parameters of the different architectures (kernel 
and max pooling size for each layer and the number of layers) 
can be seen in Figure Hyper-parameters were tuned using 


grid search and cross-validation on a validation set (see Bengio 
ID)- The chosen hyper-parameters were the ones for which the 
model performed best on the validation set. For max pooling, 
we always use a stride of 1. This is to keep per-pixel accuracy 
during full image prediction. We observed in practice that max 
pooling in the global path does not improve accuracy. We also 
found that adding additional layers to the architectures or in¬ 
creasing the capacity of the model by adding additional feature 
maps to the convolutional blocks do not provide any meaning¬ 
ful performance improvement. 

Biases are initialized to zero except for the softmax layer for 
which we initialized them to the log of the label frequencies. 
The kernels are randomly initialized from U (-0.005,0.005). 
Training takes about 3 minutes per epoch for the TwoPathCNN 
model on an NVIDIA Titan black card. 

At test time, we run our code on a GPU in order to exploit 
its computational speed. Moreover, the convolutional nature of 
the output layer allows us to further accelerate computations at 
test time. This is done by feeding as input a full image and not 
individual patches. Therefore, convolutions at all layers can be 
extended to obtain all label probabilities p{Yij\S.) for the entire 
image. With this implementation, we are able to produce a seg¬ 
mentation in 25 seconds per brain on the Titan black card with 
the TwoPathCNN model. This turns out to be 45 times faster 
than when we extracted a patch at each pixel and processed 
them individually for the entire brain. 

Predictions for the MFCascadeCNN model, the LocalCas- 
cadeCNN model, and InputCascadeCNN model take on aver¬ 
age 1.5 minutes, 1.7 minutes and 3 minutes respectively. 

5. Experiments and Results 

The experiments were carried out on real patient data ob¬ 
tained from the 2013 brain tumor segmentation challenge 
(BRATS2013), as part of the MICCAI conference (13. The 
BRATS2013 dataset is comprised of 3 sub-datasets. The train¬ 
ing dataset, which contains 30 patient subjects all with pixel- 
accurate ground truth (20 high grade and 10 low grade tumors); 
the test dataset which contains 10 (all high grade tumors) and 
the leaderboard dataset which contains 25 patient subjects (21 
high grade and 4 low grade tumors). There is no ground truth 
provided for the test and leaderboard datasets. All brains in 
the dataset have the same orientation. For each brain there ex¬ 
ists 4 modalities, namely Tl, TIC, T2 and Flair which are co¬ 
registered. The training brains come with groundtruth for which 
5 segmentation labels are provided, namely non-tumor, necro¬ 
sis, edema, non-enhancing tumor and enhancing tumor. Fig¬ 
ure]^ shows an example of the data as well as the ground truth. 
In total, the model iterates over about 2.2 million examples of 
tumorous patches (this consists of all the 4 sub-tumor classes) 
and goes through 3.2 million of the healthy patches. As men¬ 
tioned before during the first phase training, the distribution of 
examples introduced to the model from all 5 classes is uniform. 

Please note that we could not use the BRATS 2014 dataset 
due to problems with both the system performing the evalu¬ 
ation and the quality of the labeled data. For these reasons 
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Figure 4: The first four images from left to right show the MRI 
modalities used as input channels to various CNN models and 
the fifth image shows the ground truth labels where ■ edema, 
■ enhanced tumor, ■ necrosis, ■ non-enhanced tumor. 


the old BRATS 2014 dataset has been removed from the of¬ 
ficial website and, at the time of submitting this manuscript, 
the BRATS website still showed: “Final data for BRATS 2014 
to be released soon”. Furthermore, we have even conducted 
an experiment where we trained our model with the old 2014 
dataset and made predictions on the 2013 test dataset; however, 
the performance was worse than our results mentioned in this 
paper. For these reasons, we decided to focus on the BRATS 
2013 data. 

As mentioned in Section we work with 2D slices due to 
the fact that the MRI volumes in the dataset do not posses an 
isotropic resolution and the spacing in the third dimension is 
not consistent across the data. We explored the use of 3D infor¬ 
mation (by treating the third dimension as extra input channels 
or by having an architecture which takes orthogonal slices from 
each view and makes the prediction on the intersecting cen¬ 
ter pixel), but that didn’t improve performance and made our 
method very slow. 

Note that as suggested by Krizhevsky et al. 1771 . we applied 
data augmentation by flipping the input images. Unlike what 
was reported by Zeiler and Fergus ill, it did not improve the 
overall accuracy of our model. 

Quantitative evaluation of the models performance on the test 
set is achieved by uploading the segmentation results to the on¬ 
line BRATS evaluation system ifT^ . The online system pro¬ 
vides the quantitative results as follows: The tumor structures 
are grouped in 3 different tumor regions. This is mainly due 
to practical clinical applications. As described by Menze et al. 
EH, tumor regions are defined as: 

a) The complete tumor region (including all four tumor 
structures). 

b) The core tumor region (including all tumor structures 
exept “edema”). 

c) The enhancing tumor region (including the “enhanced 
tumor” structure). 

For each tumor region. Dice (identical to F measure). Sensi¬ 
tivity and Specificity are computed as follows : 

Dice(P, T) 

S ensitivity(P, T) 

S pecificity(P, T) 


1/^1 A Til 
(|Fi| + |ri|)/2’ 
1/^1 A Til 

17 ^ 11 ’ 

17^0 A TqI 
IT^ol ’ 


where P represents the model predictions and T represents 
the ground truth labels. We also note as Ti and To the sub¬ 
set of voxels predicted as positives and negatives for the tumor 
region in question. Similarly for Pi and Pq. The online evalua¬ 
tion system also provides a ranking for every method submitted 
for evaluation. This includes methods from the 2013 BRATS 
challenge published in 1 ^ as well as anonymized unpublished 
methods for which no reference is available. In this section, we 
report experimental results for our different CNN architectures. 

5.7. The TwoPathCNN architecture 

As mentioned previously, unlike conventional CNNs, the 
TwoPathCNN architecture has two pathways: a “local” path 
focusing on details and a “global” path more focused on the 
context. To better understand how joint training of the global 
and local pathways benefits the performance, we report results 
on each pathway as well as results on averaging the outputs of 
each pathway when trained separately. Our method also deals 
with the unbalanced nature of the problem by training in two 
phases as discussed in Section |3.2[ To see the impact of the 
two phase training, we report results with and without it. We 
refer to the CNN model consisting of only the local path (i.e. 
conventional CNN architecture) as LocalPathCNN, the CNN 
model consisting of only the global path as GlobalPathCNN, 
the model averaging the outputs of the local and global paths 
(i.e. LocalPathCNN and GlobalPathCNN) as AverageCNN 
and the two-pathway CNN architecture as TwoPathCNN. The 
second training phase is noted by appending to the archi¬ 
tecture name. Since the second phase training has a substantial 
effect and always improves the performance, we only report re¬ 
sults on GlobalPathCNN and AverageCNN with the second 
phase. 

Table presents the quantitative results of these variations. 
This table contains results for the TwoPathCNN with one and 
two training phases, the common single path CNN (i.e. Lo¬ 
calPathCNN) with one and two training phases, the Global¬ 
PathCNN* which is a single path CNN model following the 
global pathway architecture and the output average of each of 
the trained single-pathway models (AverageCNN*). Without 
much surprise, the single path with one training phase CNN 
was ranked last with the lowest scores on almost every region. 
Using a second training phase gave a significant boost to that 
model with a rank that went from 15 to 9. Also, the table shows 
that joint training of the local and global paths yields better per¬ 
formance compared to when each pathway is trained separately 
and the outputs are averaged. One likely explanation is that by 
joint training the local and global paths, the model allows the 
two pathways to co-adapt. In fact, the AverageCNN* performs 
worse than the LocalPathCNN* due to the fact that the Glob¬ 
alPathCNN* performs very badly. The top performing method 
in the uncascaded models is the TwoPathCNN* with a rank of 
4. 

Also, in some cases results are less accurate over the En¬ 
hancing region than for the Core and Complete regions. There 
are 2 main reasons for that. First, borders are usually diffused 
and there are no clear cut between enhanced tumor and non- 
enhanced tissues. This creates problems for both user labeling. 
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Figure 5: Randomly selected filters from the first layer of the 
model. From left to right the figure shows visualization of fea¬ 
tures from the first layer of the global and local path respec¬ 
tively. Features in the local path include more edge detectors 
while the global path contains more localized features. 


ground truth, as well as the model. The second reason is that 
the model learns what it sees in the ground truth. Since the la¬ 
bels are created by different people and since the borders are 
not clear, each user has a slightly different interpretation of the 
borders of the enhanced tumor and so sometimes we see overly 
thick enhanced tumor in the ground truth. 

Figure shows representation of low level features in both 
local and global paths. As seen from this figure, features in the 
local path include more edge detectors while the ones in the 
global path are more localized features. Unfortunately, visual¬ 
izing the learned mid/high level features of a CNN is still very 
much an open research problem. However, we can study the im¬ 
pact these features have on predictions by visualizing the seg¬ 
mentation results of different models. The segmentation results 
on two subjects from our validation set, produced by different 
variations of the basic model can be viewed in Figure 1 ’ As 
shown in the figure, the two-phase training procedure allows 
the model to learn from a more realistic distribution of labels 
and thus removes false positives produced by the model which 
trains with one training phase. Moreover, by having two path¬ 
ways, the model can simultaneously learn the global contextual 
features as well as the local detailed features. This gives the 
advantage of correcting labels at a global scale as well as rec¬ 
ognizing fine details of the tumor at a local scale, yielding a bet¬ 
ter segmentation as oppose to a single path architecture which 
results in smoother boundaries. Joint training of the two con¬ 
volutional pathways and having two training phases achieves 
better results. 


5.2. Cascaded architectures 

We now discuss our experiments with the three cascaded 
architectures namely InputCascadeCNN, LocalCascadeCNN 
and MFCascadeCNN. Tablej^ provides the quantitative results 
for each architecture. Figure^also provides visual examples of 
the segmentation generated by each architecture. 

We find that the MFCascadeCNN* model yields smoother 
boundaries between classes. We hypothesize that, since the 


^It is important to note that we do not train the model on the validation set 
and thus the quality of the results is not due to overfitting 



Figure 8: Visual results from our top performing model, In¬ 
putCascadeCNN* on Coronal and Sagittal views. The subjects 
are the same as in Figure |7] In every sub-figure, the top row 
represents the Sagital view and the bottom row represents the 
Coronal view. The color codes are as follows: ■ edema, ■ 
enhanced tumor, ■ necrosis, ■ non-enhanced tumor. 


neurons in the softmax output layer are directly connected to 
the previous outputs within each receptive field, these parame¬ 
ters are more likely to learn that the center pixel label should 
have a similar label to its surroundings. 

As for the LocalCascadeCNN* architecture, while it resulted 
in fewer false positives in the complete tumor category, the per¬ 
formance in other categories (i.e. tumor core and enhanced tu¬ 
mor) did not improve. 

Figure shows segmentation results from the same brains 
(as in Figure I?]) in Sagittal and Coronal views. The InputCas¬ 
cadeCNN* model was used to produce these results. As seen 
from this figure, although the segmentation is performed on 
Axial view but the output is consistent in Coronal and Sagittal 
views. Although subjects in Figure 5 and Figure 6 are from our 
validation set for which the model is not trained on and the seg¬ 
mentation results from these subjects can give a good estimate 
of the models performance on a test set, however, for further 
clarity we visualise the models performance on two subjects 
from BRATS-2013 testst. These results are shown in Figure]^ 
in Saggital (top) and Axial (bottom) views. 

To better understand the process for which InputCas¬ 
cadeCNN* learns features, we present in Figurethe progres¬ 
sion of the model by making predictions at every few epochs on 
a subject from our validation set. 

Overall, the best performance is reached by the InputCas¬ 
cadeCNN* model. It improves the Dice measure on all tu¬ 
mor regions. With this architecture, we were able to reach 
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Figure 6: Progression of learning in InputCascadeCNN*. The stream of figures on the first row from left to right show the learning 
process during the first phase. As the model learns better features, it can better distinguish boundaries between tumor sub-classes. 
This is made possible due to uniform label distribution of patches during the first phase training which makes the model believe 
all classes are equiprobable and causes some false positives. This drawback is alleviated by training a second phase (shown in 
second row from left to right) on a distribution closer to the true distribution of labels. The color codes are as follows: ■ edema, ■ 
enhanced tumor, ■ necrosis, ■ non-enhanced tumor. 


Table 1: Performance of the TwoPathCNN model and variations. The second phase training is noted by appending to the 
architecture name. The ‘Rank’ column represents the ranking of each method in the online score board at the time of submission. 


Rank 

Method 


Dice 



Specificity 


Sensitivity 

Complete 

Core 

Enhancing 

Complete 

Core 

Enhancing 

Complete 

Core 

Enhancing 

4 

TwoPathCNN* 

0.85 

0.78 

0.73 

0.93 

0.80 

0.72 

0.80 

0.76 

0.75 

9 

LocalPathCNN* 

0.85 

0.74 

0.71 

0.91 

0.75 

0.71 

0.80 

0.77 

0.73 

10 

AverageCNN* 

0.84 

0.75 

0.70 

0.95 

0.83 

0.73 

0.77 

0.74 

0.73 

14 

GlobalPathCNN* 

0.82 

0.73 

0.68 

0.93 

0.81 

0.70 

0.75 

0.65 

0.70 

14 

TwoPathCNN 

0.78 

0.63 

0.68 

0.67 

0.50 

0.59 

0.96 

0.89 

0.82 

15 

LocalPathCNN 

0.77 

0.64 

0.68 

0.65 

0.52 

0.60 

0.96 

0.87 

0.80 
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Figure 7: Visual results from our CNN architectures from the Axial view. For each sub-figure, the top row from left to right 
shows TIC modality, the conventional one path CNN, the Conventional CNN with two training phases, and the TwoPathCNN 
model. The second row from left to right shows the ground truth, LocalCascadeCNN model, the MFCascadeCNN model and the 
InputCascadeCNN. The color codes are as follows: ■ edema, ■ enhanced tumor, ■ necrosis, ■ non-enhanced tumor. 
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Table 2: Performance of the cascaded architectures. The reported results are from the second phase training. The ‘Rank’ column 
shows the ranking of each method in the online score board at the time of submission. 


Rank 

Method 


Dice 


Specificity 

Sensitivity 

Complete 

Core 

Enhancing 

Complete 

Core 

Enhancing 

Complete 

Core 

Enhancing 

2 

InputCascadeCNN* 

0.88 

0.79 

0.73 

0.89 

0.79 

0.68 

0.87 

0.79 

0.80 

4-a 

MFCascadeCNN* 

0.86 

0.77 

0.73 

0.92 

0.80 

0.71 

0.81 

0.76 

0.76 

4-c 

LocalCascadeCNN* 

0.88 

0.76 

0.72 

0.91 

0.76 

0.70 

0.84 

0.80 

0.75 



Figure 9: Visual segmentation results from our top performing 
model, InputCascadeCNN*, on examples of the BRATS2013 
test dataset in Saggital (top) and Axial (bottom) views. The 
color codes are as follows: ■ edema, ■ enhanced tumor, ■ 
necrosis, ■ non-enhanced tumor. 


the second rank on the BRATS 2013 scoreboard. While MF- 
CascadeCNN*, TwoPathCNN* and LocalCascadeCNN* are 
all ranked 4, the inner ranking between these three models is 
noted as 4a, 4b and 4c respectively. 

Table shows how our implemented architectures compare 
with currently published state-of-the-art methods as mentioned 
in 13210 The table shows that InputCascadeCNN* out per¬ 
forms Tustison et al. the winner of the BRATS 2013 challenge 
and is ranked first in the table. Results from the BRATS-2013 
leaderboard presented in Tableshows that our method outper¬ 
forms other approaches on this dataset. We also compare our 
top performing method in Tablej^with state-of-the-art methods 
on BRATS-2012, ”4 label” test set as mentioned in l3^ . As 
seen from this table, our method out performs other methods in 
the tumor Core category and gets competitive results on other 
categories. 

Let us mention that Tustison’s method takes 100 minutes to 
compute predictions per brain as reported in |32l, while the 
InputCascadeCNN* takes 3 minutes, thanks to the fully con¬ 
volutional architecture and the GPU implementation, which is 
over 30 times faster than the winner of the challenge. The 
TwoPathCNN* has a performance close to the state-of-the-art. 
However, with a prediction time of 25 seconds, it is over 200 
times faster than Tustison’s method. Other top methods in the 
table are that of Meier et al and Reza et al with processing 
times of 6 and 90 minutes respectively. Recently Subbanna 
et al. 1431 published competitive results on the BRATS 2013 
dataset, reporting dice measures of 0.86,0.86,0.77 for Com¬ 
plete, Core and Enhancing tumor regions. Since they do not 
report Specificity and Sensitivity measures, a completely fair 
comparison with that method is not possible. However, as men¬ 
tioned in m, their method takes 70 minutes to process a sub¬ 
ject, which is about 23 times slower than our method. 

Regarding other methods using CNNs, Urban et al. EH used 
an average of two 3D convolutional networks with dice mea¬ 
sures of 0.87,0.77,0.73 for Complete, Core and Enhancing tu¬ 
mor regions on BRATS 2013 test dataset with a prediction time 
of about 1 minute per model which makes for a total of 2 min¬ 
utes. Again, since they do not report Specificity and Sensitiv¬ 
ity measures, we can not make a full comparison. However, 
based on their dice scores our TwoPathCNN* is similar in per- 


^Please note that the results mentioned in Tablej^and Tableware from meth¬ 
ods competing in the BRATS 2013 challenge for which a static table is provided 
[https://www.virtualskeleton.ch/BRATS/StaticResults2013]. Since then, other 
methods have been added to the score board but for which no reference is avail¬ 
able. 
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Table 5: Comparison of our top implemented architectures with 
the state-of-the-art methods on the BRATS-2012 ”4 label” test 
set as discussed in EH. 


Method 


Diee 


Complete 

Core 

Enhaneing 

InputCascadeCNN* 

0.81 

0.72 

0.58 

Subbanna 

0.75 

0.70 

0.59 

Zhao 

0.82 

0.66 

0.42 

Tustison 

0.75 

0.55 

0.52 

Festa 

0.62 

0.50 

0.61 


formance while taking only 25 seconds, which is four times 
faster. And the InputCascadeCNN* is better or equal in accu¬ 
racy while having the same processing time. As for ia, they 
do not report results on BRATS 2013 test dataset. However, 
their method is very similar to the LocalPathCNN which, ac¬ 
cording to our experiments, has worse performance. 

Using our best performing method, we took part in the 
BRATS 2015 challenge. The BRATS 2015 training dataset 
comprises of 220 subjects with high grade and 54 subjects with 
low grade gliomas. There are 53 subjects with mixed high and 
low grade gliomas for testing. Every participating group had 
48 hours from receiving the test subjects to process them and 
submit their segmentation results to the online evaluation sys¬ 
tem. BRATS’ 15 contains the training data of 2013. The ground 
truth for the rest of the training brains is generated by a voted 
average of segmented results of the top performing methods in 
BRATS’13 and BRATS’12. Some of these automatically gen¬ 
erated ground truths have been refined manually by a user. 

Because distribution of the intensity values in this dataset is 
very variable from one subject to another, we used a 7 fold cross 
validation for training. At test time, a voted average of these 
models was made to make prediction for each subject in the test 
dataset. The results of the challenge are presented in Figure 
The semi-automatic methods participating in the challenge have 
been highlighted in grey. Please note since these results are not 
yet publicly available, we refrain from disclosing the name of 
the participants. In this figure the semi-automatic methods are 
highlighted in gray. As seen from the figure, our method ranks 
either first or second on Complete tumor and tumor Core cat¬ 
egories and gets competitive results on active tumor category. 
Our method has also less outliers than most other approaches. 
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Figure 10: Our BRATS’15 challenge results using InputCas- 
cadeCNN*. Dice scores and negative log Hausdorff distances 
are presented for the three tumor categories. Since the results 
of the challenge are not yet publicly available, we are unable 
to disclose the name of the participants. The semi-automatic 
methods are highlighted in gray. In each sub-figure, the meth¬ 
ods are ranked based on the mean value. The mean is presented 
in green, the median in red and outliers in blue. 


2013. The high performance is achieved with the help of a novel 
two-pathway architecture (which can model both the local de¬ 
tails and global context) as well as modeling local label depen¬ 
dencies by stacking two CNN’s. Training is based on a two 
phase procedure, which we’ve found allows us to train CNNs 
efficiently when the distribution of labels is unbalanced. 

Thanks to the convolutional nature of the models and by us¬ 
ing an efficient GPU implementation, the resulting segmenta¬ 
tion system is very fast. The time needed to segment an entire 
brain with any of the these CNN architectures varies between 
25 seconds and 3 minutes, making them practical segmentation 
methods. 


6. Conclusion 

In this paper, we presented an automatic brain tumor segmen¬ 
tation method based on deep convolutional neural networks. 
We considered different architectures and investigated their im¬ 
pact on the performance. Results from the BRATS 2013 on¬ 
line evaluation system confirms that with our best model we 
managed to improve on the currently published state-of-the-art 
method both on accuracy and speed as presented in MICCAI 
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